Measuring Similarity of Graphs and their Nodes by 
Neighbor Matching 

Mladen Nikolic* 

Faculty of Mathematics, University of Belgrade, Studentski Trg 16, Belgrade, Serbia 



Abstract 

The problem of measuring similarity of graphs and their nodes is important in 
a range of practical problems. There is a number of proposed measures, some of 
them being based on iterative calculation of similarity between two graphs and 
the principle that two nodes are as similar as their neighbors are. In our work, 
we propose one novel method of that sort, with a refined concept of similarity of 
two nodes that involves matching of their neighbors. We prove convergence of 
the proposed method and show that it has some additional desirable properties 
that, to our knowledge, the existing methods lack. We illustrate the method on 
two specific problems and empirically compare it to other methods. 
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1. Introduction 

Many or most data analysis techniques are designed for data that are repre- 
sented by vectors of numbers. However, this kind of representation often leads 
to loss of structural information contained in the original data, while preserving 
structural information may be essential in some applications. This requires a 
richer problem representation and corresponding data analysis techniques. For 
example, in many practical domains, structural information in the data can be 
represented using graphs. 

Similarity measures between objects are of central importance for various 
data analysis techniques. The same holds for the special case of similarity 
between graphs and a number of measures for this purpose have been proposed. 
In this paper, we focus on iterative methods relying on the principle that the 
nodes of two graphs are as similar as their neighbors in respective graphs are 
[U 121 [31 H] ■ These methods have been successfully applied in several domains 
like adequate ranking of query results [T], synonym extraction [3 a , database 
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structure matching [5 , construction of phylogenetic trees [2J, analysis of social 
networks [5], etc. 

In this paper, we try to identify desirable properties not present in the exist- 
ing methods for measuring similarities of graph nodes. We propose a refinement 
of the notion of similarity of two nodes which leads to a new method for measur- 
ing similarities of graph nodes and similarities of graphs. We prove convergence 
of the proposed method and show that it has some additional desirable proper- 
ties that, to our knowledge, the existing methods lack. 

We implemented the proposed method and evaluated it on two problems in 
order to illustrate that our method can capture the notion of similarity useful 
in practical problems. The first test problem was finding a subgraph of a graph 
that is isomorphic to some other given graph. The second test problem was the 
classification of Boolean formulae based on their underlying graph structure. 

The rest of the paper is organized as follows. In Section [2j we present the 
preliminaries used in this paper. Existing methods are described and analyzed 
in Section [3] In Section [4] we present our new method — the method of neigh- 
bor matching and prove its properties. Results of experimental evaluation and 
comparison to other methods are given in Section [5] In Section [6] we draw final 
conclusions and give some directions of the future work. 

2. Preliminaries 

A directed graph G = (V, E) is defined by its set of nodes V and its set of 
edges E. There is an edge between two nodes i and j if £ E. For the 

edge e = the source node is the node i, and the terminating node is the 

node j. We denote them respectively with s(e) and t(e). We say that the node 
i is an in-neighbor of node j and that node j is an out-neighbor of the node i 
if (hj) £ E- An in-degree id(i) of the node i is the number of in-neighbors of 
i, and an out-degree od(i) of the node i is the number of out-neighbors of i. A 
degree d{i) of the node i is the sum of in-degree and out-degree of i. Two graphs 
are isomorphic if there exists a bijection / : Va — > Vb, such that (i,j) £ Ea if 
and only if (f(i), f(j)) £ Eb- An isomorphism of a graph G to itself is called 
automorphism. A colored graph is a graph in which each node is assigned a 
color. For colored graphs, the definition of isomorphism additionally requests 
that nodes i and f(i) have the same color. A random Erdos-Renyi graph G n>p 
is a graph with n nodes in which each two nodes share an edge with probability 
p [7]. A graph Gb is an induced subgraph of a graph Ga if Vb Q Va and for 
each pair of nodes i,j £ Vb it holds £ Eb if and only if (i,j) £ Ea- 

The similarity measure s is a function s : D\ x D 2 R where Di and D 2 
are possibly equal sets of objects. A higher value of similarity measure should 
imply a higher similarity in some intuitive sense. Choice of a similarity measure 
to be used in some context is often guided by its usefulness in practice. 

Similarity measure over the nodes of two graphs can be represented by a 
similarity matrix X = [xij] of dimension \Va\ X \Vb\ with the element 
denoting a similarity of the nodes i € Va and j £ Vb- 
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Let A and B be two finite sets of arbitrary elements. A matching of elements 
of sets A and B is a set of pairs M — {(£, j) \i € A,j E B} such that no element of 
one set is paired with more than one element of the other set. For the matching 
M we define enumeration functions f : {1, 2, . . . k} — > A and g : {1, 2, . . . k} — > B 
such that M = {(f(l),g(l))\l = 1,2, ...,k} where k = \M\. Let w(a,b) be a 
function assigning weights to pairs of elements a E A and b E B. The goal of 
the assignment problem is to find a matching of elements of A and B with the 
highest sum of weights (if two sets are of different cardinalities, some elements 
of the larger set will not have corresponding elements in the smaller set). The 
assignment problem is usually solved by the well-known Hungarian algorithm 
of complexity 0(mn 2 ) where m = max(|A|, \B\) and n = min(|A|, \B\) [8], 
There are more efficient algorithms, such as one due to Edmonds and Karp of 
complexity 0(mn log n) [9] and even more efficient one, due to Fredman and 
Tarjan of complexity 0(mn + n 2 logn) [TO] . 

3. Existing Methods for Measuring Graph Node Similarity 

In this section we briefly describe relevant iterative methods for measuring 
similarity of graph nodes and we try to identify some desirable properties that 
they lack. 

Assume that two directed graphs Ga = (Va,Ea) and Gb = (Vb,Eb) are 
given. Iterative methods calculate similarity of nodes of these two graphs by 
repeatedly refining the initial estimate of similarity using some update rule 
of form [xy 1 ] -s— /([x^]). Iterations are performed until some termination 
condition is met. At the end, the similarity matrix X = [xij] is produced. 
Different rules for update of similarity of two nodes are proposed. They usually 
include summing all the similarities between the neighbors of first node and the 
neighbors of the second node. 

One of the first influential iterative approaches is due to Kleinberg pQ , further 
generalized by Blondel et al. [3]. In the method of Blondel et al. the update 
rule for x^j in step k + 1 is given by 

X it ^ Z-J X pq+ X P1 ' 

(p,i)€E A ,(q, 3 )£E B (,i : p)eE A ,(j,q)£E B 

The similarity matrix X is normalized by X 4— X/||X||2 after each step. 

The earlier approach by Melnik et al. [5] can be seen as a more general 
version of of this method where the similarities between neighbor nodes xz q are 
weighted. 

The method of Blondel et al. was modified by Zager and Verghese [3] to take 
into account similarity of the edges too. The update rule for the edge similarity 
matrix Y = [y uv ], where u E Ea and v E Eb, is given by 

Vuv ^~ X s(u)s(v) + X t(u)t(v)- 
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The update rule for similarity of nodes is then given in terms of similarities of 
the edges 

x ij ^ ^ ] Vuv ^ ^ VuV 

t(u)=i,t(v)—j s(u)=i,s(v)=j 

Matrix normalization of the similarity scores is applied in this approach too. 

The approach by Heymans and Singh [5] is somewhat different and more 
complex than the described methods, and we only briefly mention its most 
important aspects. In order to estimate similarity in each iteration, similarity 
terms and dissimilarity terms are calculated, based on the similarity scores of 
the previous iteration. These terms average the similarities of the in-neighbor 
and similarities of the out-neighbors. Similarity terms are calculated both for 
the original graphs and their complements. Dissimilarity terms are calculated 
using one graph and the complement of the other, and vice versa. Dissimilarity 
terms are subtracted from similarity terms to obtain new estimate of similarity 
scores. The matrix normalization is performed after each iteration. 

There are approaches that are designed for measuring similarity between the 
nodes of the same graph [TTJ |S] . We don't discuss these methods as they are 
less general than the former ones. 

The described methods lack some desirable and natural properties. Of 
course, not all the method lack all the listed properties. 

// the graphs is compared to itself, each node should be most similar to itself. 
This is a natural property, expected for all similarity measures. Nevertheless, 
for all mentioned methods it is easy to construct graphs for which there is a 
node which is more similar to some other node of the same graph than to itself. 
This can easily occur, for instance, in methods where the update rule consists 
of simple summation of similarities of neighbor nodes. This results in nodes of 
higher degree having more terms in the summation and hence, higher similarity 
with other nodes [12] . 

Similarity scores should have a fixed range with similarity of a node to itself al- 
ways taking the maximal value. It is customary for similarity measure in general 
(not only for similarity measures for graphs) to have a fixed range (e.g., from 
to 1 or from -1 to 1). Without the loss of generality, we will assume the range 
[0,1]. Also, similarity of each object to itself should be 1. These properties 
facilitate intuitive understanding of similarity scores. Well-known examples of 
measures for which these requirements are fulfilled are cosine, correlation coeffi- 
cient, Jaccard coefficient, etc. However, the mentioned methods for calculating 
graph node similarity lack this property. When the similarity scores are calcu- 
lated for the nodes of the same graph, the similarity score of one node compared 
to itself can be different from the similarity score of some other node compared 
to itself. So, one node can be more similar to itself than the other. 

It is reasonable to make even stricter requirement: if two graphs G a and 
Gb are isomorphic, with isomorphism / : Va — > Vb, the similarity score 
should be 1 for all i 6 Va. 
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Similarity scores should be meaningful in absolute terms. Due to the normal- 
ization of the similarity matrix, one similarity score Xij can change only if other 
similarity scores change accordingly. This makes additional interdependence be- 
tween similarity scores that is not a result of the topology of two graphs alone. 
It actually means that similarity scores can only reflect similarity of nodes of 
two graphs relative to each other. We can't conclude if two nodes are similar, 
but only if one pair of nodes is more similar than some other pair of nodes. 

Consider the following special case. Suppose that all the nodes of one graph 
are equally similar to all the nodes of the second graph. In a normalized matrix 
it is impossible that all the similarity scores are equal to 0, or that all the 
similarity scores are equal to 1. Because of the normalization constraint, we 
can't differentiate between all possible levels of similarity. All we can say is that 
the nodes of one graph are equally similar to all the nodes of the second graph, 
but not how much. 

It would be good if similarity scores don't represent relative magnitudes of 
similarities of pairs of nodes, but in a way "absolute" magnitudes with possibility 
of all scores having or the maximal value. 

The lack of this property, also makes it harder to use similarity scores of 
the nodes to construct the similarity measure of whole graphs. Heymans and 
Singh [5] were able to achieve this because they use similarity scores that can be 
negative (as the consequence of subtracting dissimilarity scores that they use), 
but as discussed in the previous special case, it would not be possible with other 
methods. 

If two nodes don't have ingoing or outgoing edges, they should be considered sim- 
ilar. To our knowledge, this property is present only in the method of Heymans 
and Singh. We believe that concepts of in-similarity and out-similarity should 
be recognized. Moreover, in-similarity and out-similarity should be 1 if there 
are no in-neighbors or out-neighbors. 

4. Method of Neighbor Matching 

In this section we refine the notion of node similarity. Based on that refine- 
ment, we describe a new method (we call this method the method of neighbor 
matching) for measuring similarity of nodes of graphs and prove its properties. 
Then, we define a measure of similarity of whole graphs based on the similarities 
of their nodes. 

4-.1. Notion of Similarity of Graph Nodes 

In the existing methods, the calculation of similarity cCy is based on adding or 
averaging the similarities of all the neighbors of node i € Va to all the neighbors 
of node j G Vb- We propose a modification to that approach, illustrated by 
the following intuition. We perceive our two hands to be very similar, but not 
because all the fingers of the left hand are very similar to all the fingers of 
the right hand, but rather because of the property that to each finger of the 
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left hand corresponds one finger of the right hand that is very similar to it. 
By analogy, the concept of similarity can be refined — two nodes i G Va and 
j G Vb are considered to be similar if neighbor nodes of i can be matched to 
similar neighbor nodes of j (hence the name neighbor matching) . 

4-2. Measuring Similarity of Graph Nodes 

As in other related methods, similarity scores are calculated as the fixed 
point of the iterative procedure defined by some update rule. In our method, 
we will differentiate between in-similarity Sj„ and out-similarity s ou t and will 
give them equal weights. In order to calculate in-similarity, the matching of 
in- neighbors with maximal sum of similarities (as described in Section [2]) has to 
be constructed, and analogously for out-similarity. More formally, the update 
rule is given by 

fc+l gtro^lj) + S out(h j) 

*ii <" 2 
In and out similarities are defined by 

n in n out 
tn l=1 out l=l 

(1) 

mi n = max(id(i) , id(j)) rn out — m.ax.(od(i),od(j)) 

n in = mm(id(i),id(j)) n out = mm{od(i), od(j)) 

where functions /™ and g™ are the enumeration functions of the optimal match- 
ing of in-neighbors of nodes i and j with weight function w(a, b) = x k ah . In the 
equation 111 we define jj to be 1. This convention ensures that the similarity of 
nodes with no in or no out neighbors is recognized. If there is a difference in 
the number of in or out neighbors, that difference is penalized when calculating 
corresponding similarities since rrii n and m out are greater than the number of 
terms in the summation (which are each less or equal to 1 as we show later). 

This method is easily extended to colored graphs. By definition, we can set 
x^, to be if nodes i and j are of different color. 

As in other iterative methods, one has to choose the initial similarity scores 
Xfj. In our method, we set = 1 for all i G Ea, j G Eb- Though the choice 
may seem arbitrary, note that in the first iteration it leads to intuitive results. 

j mm(id(i),id(j)) x mm(od(i) , od(j)) 

Sm = max(id(i),id(j)) S ° ut = max(od(i),d(j)) 

If, for instance, a node i has 3 in-neighbors and a node j has 5 in-neighbors, 
the in-similarity of nodes i and j in the first iteration will be |. We find that 
to be an intuitive choice if we don't yet know anything about the similarities 
of the neighbor nodes — in that case we can only reason about the number of 
neighbor nodes. 

The termination condition is maxjj \x^ ■ — x\~ l \ < e for some chosen precision 
e. Alternative termination condition could be used too. 
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Figure 1: Two example graphs given by Zager [4]- 





Is 


2s 


3s 


4 B 


5s 


6s 




0.682 


0.100 


0.597 


0.200 


0.000 


0.000 


2a 


0.000 


0.364 


0.045 


0.195 


0.400 


0.000 


3a 


0.000 


0.000 


0.000 


0.091 


0.091 


0.700 



Table 1: Similarity scores for graphs given in Figure^ calculated using the method of neighbor 
matching for e = 10 — 4 . 



Note that our method has computationally more complex update rule com- 
pared to previous methods. Other methods include summation of total id(i)id(j) 
terms for in-neighbors and total od(i)od(j) terms for out-neighbors. In our 
method, we have to solve the assignment problem for id(i) and id(j) in-neighbors 
and for od(i) and od{j) out-neighbors. Since efficient algorithms for the assign- 
ment problem (mentioned in Section [3]) exist, its complexity should not be of 
big practical importance. Also, as it will be discussed in Section [5j for practical 
purposes, in the case of dense graphs, one could switch to complement graphs 
(that are sparse in this case) and so reduce the computation time. 

Example 1. In order to illustrate our method, we applied it on example graphs 
(shown in Figure^^ used by Zager J^. The similarity scores for the nodes of 
the graphs are presented in Table [7j 

The proposed method converges, as stated by the following theorem. 

Theorem 1. For any choice of graphs G a and Gb, for each pair of nodes 
i G Va and j E Vb, there exists Xij = lim? _i. oo x\j with a value in range [0, 1]. 

Proof. For any i e Va and j € Vb, the corresponding sequence (a;*j-))*L is 
nonincreasing. We will prove this by induction on the number of iterations k. 

The initial similarity score x^ for some i and j is equal to 1. The weight of 
the optimal matching when calculating in or out similarity is equal ni n , or n out 
respectively, since the weight of the matching any two nodes is 1. Since m in > 
n in and m out > n out it holds s} n (i,j) = ^ < 1 and s^ ut (i,j) = < 1, and 
the same holds for xh being the arithmetic mean of the two values. This proves 
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that in the first step, the similarity scores cannot grow, which is the base of the 
induction. 

Suppose that up to the step k the sequence of scores x\j is nonincreasing, 
meaning that x\^ < x^ 1 . This actually states that the weights of matching of 
any two nodes when calculating 1 and s^ 1 are not greater than the weights 
when calculating s!- n and s* ut , and thus s^ 1 < sf n and s£+/ < s k out . We show 
this for in-similarity, and the reasoning for out-similarity is analogous. Let 
and g^j be the enumeration functions of the optimal matching of in-neighbors 
of nodes i € Va and j e Vb in iteration k. Then, it holds 

riin Kin n-in 

Efc-t-1 . \ "V V ^ 

X f-+\l)9t +1 (l) - l*, X f* +1 (l)gt+ 1 (l) ~ l^ X ft 3 (l)9%(l) 
1 = 1 3 3 1 = 1 1=1 

The first inequality holds by inductive hypothesis, and the second by the op- 
timality of the matching, defined by and g^, in iteration k. Dividing all 
three expressions by m in , we conclude s^ 1 («,j) < s[} n (i, j). The same holds for 
out-similarities. Consequently, we have x^ 1 < x%. This proves the inductive 
step. Hence, the sequence of similarity scores (a^)£L is nonincreasing. 

By induction on the number of iterations we prove that in all the iterations, 
all the similarity scores are nonnegative. In the first iteration, all the scores are 
nonnegative. In each subsequent iteration, the update rule consists of averaging 
some of the scores from the previous iteration. By averaging nonnegative values 
one cannot obtain a negative value, so each sequence of similarity scores is 
nonnegative and thus, bounded from below by zero. Nonincreasing sequence 
bounded from below must have a limit, so Xij = lim^oo x\j exists. Since the 
sequence is nonincreasing and x^ = 1, the limit can't be greater than 1. Also, 
since all the elements are nonnegative, the limit also has to be nonnegative. 
This proves the theorem. □ 

Simple examples can be produced to show that the bounding interval [0, 1] 
is tight. 

Important property of the similarity for isomorphic graphs is established by 
the following theorem. 

Theorem 2. For two isomorphic graphs Ga and Gb, let f : Va — > Vb be an 
isomorphism between two graphs. For each node i e Va, it holds that Xif^ = 1. 

Proof. We show that x^,^ = 1 for all i e Va and all k > by induction on the 
number of iterations k. 

The initial value x^^ is equal to 1 for all i £ Va, by definition. This is 
the base of the induction. Let k > 0, assume ) = 1 for all i e Va, and 
consider x^}y Since / is an isomorphism of two graphs, nodes i and f(i) must 
have the same number of in-neighbors and out-neighbors. Hence, m in = n in 
and m out — n out . It suffices to prove that the weights of the optimal matchings 
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when calculating in and out similarity are equal to rii n and n out respectively. 
We discuss in-similarity first. Since / is the isomorphism, it maps all the in- 
ncighbors of node i to in-neighbors of node f(i). The weights £ /(a) of matching 
each in-neighbor a of i to in-neighbor f(a) of f(i) are equal to 1 by the inductive 
hypothesis, thus being maximal. So the matching of each in-neighbor a of i to 
in-neighbor /(a) of f(i) is optimal. Since there is rii n in-neighbors, the weight 
of the optimal matching of in-neighbors is rii n . Analogous reasoning is used to 
show that the weight of the optimal matching of out-neighbors is equal to n out . 
Therefore, both in and out similarity of i and f(i) in step k + 1 are equal to 1 
for all i eVa and so, the similarity score is also equal to 1 for all i € Va. 

Since ^lfu\ — 1 f° r ah k > 0, and i € Va, the limit is also 1 for all 

ieV A . 1 1 ' □ 

In the case Ga — Gb where / is the trivial automorphism f(i) — i for all 
i £ Va, this theorem implies a simple corollary. 

Corollary 1. For any graph Ga and each node i £ Va, it holds Xa = 1. 

It is easy to check that the proven theorems hold for colored graphs too. 

By the above statements, the neighbor matching method fulfills the first two 
requirements listed in Section [3j The matrix normalization is avoided and it 
is easy to produce examples of graphs with all the similarity values being or 
all the similarity values being 1. Similarity of nodes due to lack of in or out 
neighbors is recognized because in that case in or out similarity will be equal to 
1. So, we can conclude that all the requirements listed in Section [3] are met. 

4-. 3. Measuring Similarity of Graphs 

The method of neighbor matching can be used to construct a similarity 
measure of two graphs in the way of Heymans and Singh [2. . When the similarity 
scores Xij for graphs Ga and Gb are computed, the optimal matching between 
their nodes can be found by solving the assignment problem between the nodes 
from Va and Vb with the weight of matching two nodes being the similarity of 
the nodes. Let / and g be enumeration functions for the optimal matching and 
n = min(|VA|, |Vg|). Then, similarity of graphs Ga and Gb can be computed 
by 

1 - 

s{G A ,G B ) = ~2_, x f(i)g{i)- ( 2 ) 
i=i 

By Theorem [T] the value of the similarity measure s is bounded in the interval 
[0, 1]. As a simple corollary of theorem [2] if Ga and Gb are isomorphic, it holds 
s(G A , G B ) = 1. 

Of course, different similarity measures for graphs could be constructed based 
on the similarities of their nodes. For instance, the sum of weights of the optimal 
matching could be divided by max(|VA|, \ Vb\) instead of mm(|VA.|, |Vb|). Such a 
choice would penalize the difference in size when comparing two graphs. Another 
interesting choice would be to take the average of all the values in the similarity 
matrix. In such a case, graphs with greater number of automorphisms would be 
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considered to be more self-similar than graphs without automorphisms. In the 
rest of the paper we will use the measure defined by the equation [2] 

5. Experimental Evaluation 

We implemented the method of neighbor matching and the methods of 
Zager and Verghese and of Heymans and Singh in C++|^] For solving the 
assignment problem, we used an available implementation of the Hungarian 
algorithm [13] . Nevertheless, more efficient algorithms (mentioned in Section [2| 
exist. 

In this section, we describe two experiments we performed to test the per- 
formance of our method. The first one was related to the matching of the 
isomorphic subgraph, and the second one was the classification of the Boolean 
formulae. 

5.1. Isomorphic Subgraph Matching 

Here we present a slightly modified experiment from Zager and Verghese [3] 
which we use to compare several methods for computing node similarity. We 
will consider a problem of finding a subgraph of a graph A that is isomorphic 
to some other graph B. We will use random Erdos-Renyi graphs G„ p . The 
experiment consists of generating a random graph A of size n and randomly 
selecting m < n nodes which induce a subgraph B of A. The similarity of nodes 
of A and B is calculated, the assignment problem between the nodes of A and 
B is solved, and the matching of the nodes is obtained. Then, it is checked if 
graph B is isomorphic to the subgraph of A induced by the obtained matching. 

For n — 15, this procedure is repeated 500 times for each pair of m = 
8, 9, . . . , 15 and p = 0.2, 0.4, 0.6, 0.8, and the accuracy of the method (the per- 
centage of correct guesses) is calculated for each pair. Required numeric pre- 
cision when calculating similarities for all the methods was e — 10~ 4 , and the 
same termination condition was used — max^ — < e. 

The methods compared were the method of neighbor matching (NM), the 
one of Heymans and Singh (HS), and the one of Zager and Verghese (ZV). It was 
noted that NM and ZV methods are heavily influenced by density parameter 
p both in matching performance and speed, while the HS method is not. We 
believed that it is due to the fact that HS method is considering both the 
input graphs and their complements. As suggested in Section |4j we made a 
modification to other two methods which we call "the complement trick" — for 
dense graphs (p > 0.5) the similarity of nodes is measured for the complement 



The source code of the implementation of the neighbor matching method is available from 
http: //www. mat f .bg . ac .rs/~nikolic/sof tware .html 

" The OH — h implementation of the method of Heymans and Singh was obtained by a simple 
transformation of Java implementation kindly provided by Ambuj Singh. 
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graphs instead of the original input graphsjj This introduced methods NM* 
and ZV*. For completeness of the evaluation, we introduced HS*, too. 

For each method, for each value of parameter p, we present one plot that 
shows the percentage of successes in isomorphic subgraph matching for each 
value of m. The plots are presented in figures |2|3[ and [4] It can be noted 
that the accuracy of ZV and ZV* generally rises much later than for the other 
methods. NM* obviously performs the best. 

In Table [2j for each method, we present the overall accuracy in the experi- 
ment and the total time spent for the experiment. 





NM 


NM* 


HS 


HS* 


ZV 


ZV* 


Accuracy 


27.3 


37.8 


17.5 


17.5 


13.9 


15.0 


Time 


2062s 


838s 


11511s 


11730s 


349s 


230s 



Table 2: Overall accuracy and time needed for the experiment, for each method used. 

The complement trick obviously improved NM and ZV methods. As ex- 
pected, it did not affect the HS method. For NM* and ZV* methods, apart 
from boosting the accuracy, the computation time is significantly reduced. For 
NM method, this modification reduces the computation time for solving the 
assignment problem in NM update rule, since it reduces the number of nodes 
to be matched in the cases when this number can be large (dense graphs). 

5.2. The Classification of Boolean Formulae 

Here we present the problem of the classification of Boolean formulae which 
we use to show that our method can capture a meaningful similarity in a real 
world problem. 

Various important practical problems can be modeled in Boolean logic in- 
cluding problems in electronic design automation, software and hardware verifi- 
cation, scheduling, timetabling, artificial intelligence, and other domains. Each 
instance of the problem is represented by a Boolean formula. Classification of 
Boolean formulae has been investigated in order to automatically tune SAT 
solvers (systems for checking the satisfiability of Boolean formulae) that is a 
practically important and challenging problem. A very reliable approach to 
Boolean formulae classification is based on measuring the distances between the 
formulae [H]. In that approach, in order to compute the distance between the 
formulae, they are represented by numerical vectors of some syntactical fea- 
tures, that can be computed for each formula. However, Boolean formulae have 
a natural variable-clause graph representation [TJ that could be used for their 
classification. 



3 The complement trick could be given an intuitive rationale. For instance, consider one 
trying to reason about similarity of two sparse graphs based on their adjacency matrices. 
Probably, one would spot ones in the matrices and analyze their arrangements in some way. 
If the graphs were dense it would be much easier to spot zeroes and reason about them. 
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Subgraph size (m) Subgraph size (m) 

Figure 2: Accuracy of isomorphic subgraph matching for NM and NM* methods. 
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Figure 3: Accuracy of isomorphic subgraph matching for HS and HS* methods. 
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Figure 4: Accuracy of isomorphic subgraph matching for ZV and ZV* methods. 
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We performed the classification of Boolean formulae using our similarity 
measure for graphs on their graph representation. We used 149 structured in- 
stances from SAT competition 2002 benchmark set (which is one of the standard 
benchmarks sets for S AT) Q Most of the formulae had up to 1000 nodes, but 
25 of them were larger (up to 5280 nodes) . Formulae were grouped in 9 classes 
corresponding to the problems the formulae originate from. Graphs correspond- 
ing to the formulae had from 122 to 5280 nodes. Differences in graph size of 
order of magnitude were present within each class too. The classification was 
performed using the k nearest neighbors algorithm with leave one out evaluation 
procedure — for each formula F, its graph similarity to the remaining formulae 
was computed, and the set N(k) of k most similar formulae was determined. 
Formula F is classified to the class that has the most representatives in the set 
N(k). For the evaluation of the classification performance, we measured the 
accuracy of the classification — number of correctly classified formulae divided 
by the total number of formulae being classified. 

The best accuracy of the classification was 93% for k = 7. The best accuracy 
for a domain specific approach from [2] on the same set is 96% for k — 1. Only 
slightly more accurate, the domain specific approach is based on long lasting 
research in the field [HI [TBI 03] • It is interesting to see that the general approach, 
not designed specifically for this purpose, can achieve a very high accuracy. Most 
importantly, we confirmed that our similarity measure can capture a meaningful 
similarity in a real world problem. 

A very interesting remark concerning this experiment is that the difference in 
size of the compared graphs did not influence the adequateness of the similarity 
measure. This kind of robustness might be interesting for practical applications. 

6. Conclusions and Future Work 

We proposed a refined notion of similarity of graph nodes, and based on 
that refinement we developed a new iterative method for measuring similarity 
of nodes of two graphs. This method was extended to a method for measuring 
similarity of whole graphs. We proved the convergence of the method and 
showed that it has several desirable properties (listed in Section |3| that, to our 
knowledge, the existing methods lack. 

We implemented the method and evaluated the implementation on two test 
problems. On one test problem (the isomorphic subgraph matching problem), 
we confirmed that the proposed method performs better than other methods. 
On the second one, it is confirmed that the graph similarity measure is able to 
capture a meaningful similarity in a real world problem. The method showed to 
be robust to differences in graph size. The performance on dense graphs can be 
significantly boosted by measuring the similarity of nodes of complement graphs. 
This modification can significantly reduce the running time of the method. 



*The benchmarks are available from http://www.satcompetition.org 
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As for the future work, we are planning applications of the neighbor matching 
method in real-world problems in bioinformatics, text classification, and other 
domains suitable for graph similarity techniques. 
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